Gemini 1.5 Pro Overview

Google has launched Gemini 1.5 Pro, an advanced AI model that can handle various types of content, including text, video, and audio. This model is particularly skilled at understanding and answering questions about long documents, potentially containing millions of tokens. It represents a significant upgrade in answering questions from lengthy documents, videos, and audio streams. In fact, it matches or even surpasses its predecessor, Gemini 1.0 Ultra, in many standard tests and achieves over 99% accuracy in retrieving information from documents with up to 10 million tokens.

A new experimental version of Gemini 1.5 Pro is available for testing in Google AI Studio, allowing for a **1 million token context window**. This is a huge improvement over the previous limit of 200,000 tokens. With this expanded capability, users can now perform complex tasks like answering questions about large PDFs, extensive codebases, and long videos. The model can process various types of inputs—audio, visual, text, and code—all at once.

Model Architecture

Gemini 1.5 Pro is built on a sparse mixture-of-experts (MoE) architecture, which allows the model to grow in capability without requiring as much computational power. While specific technical details are limited, it has been reported that this model is more efficient, needing less computational power for training and is better at understanding long-context content (up to 10 million tokens). It has been trained on a variety of data types and has undergone tuning based on human preferences.

Performance Results

Gemini 1.5 Pro demonstrates impressive recall abilities across all content types—text, video, and audio—handling up to 1 million tokens effectively. To give you a sense of its capabilities, it can manage the following:

- Approximately 22 hours of recorded audio
- The equivalent of 10 books, each with 1440 pages
- Entire programming codebases
- 3 hours of video at a rate of 1 frame per second

In comparison to its predecessor, Gemini 1.5 Pro excels in various benchmarks, particularly in fields like Math, Science, Reasoning, Multilingual Understanding, Video Comprehension, and Code Analysis.

Key Capabilities

Gemini 1.5 Pro can process and analyze large datasets, demonstrating advanced multimodal reasoning. Here are a couple of examples:

Long Document Analysis

In Google AI Studio, users can upload entire PDFs of up to 1 million tokens. For example, if a PDF about a scientific paper is uploaded and the user asks, "What is the paper about?" the model can provide an accurate summary. Users can also use a chat format to ask multiple questions about the document, making it easier to extract information.

To test its ability further, users can upload multiple PDFs and ask questions that span both documents. For instance, if prompted to summarize details about a large language model mentioned in another paper, the model can extract relevant information, although it sometimes misattributes specific details.

Video Understanding

Gemini 1.5 Pro also shows strong video comprehension skills. For instance, if a user uploads a lecture video and asks, "What is the lecture about?" the model can provide a clear summary. When asked to outline the lecture's key points, the model generates a concise and accurate response, covering topics such as how large language models function and the resources needed for their training. However, it can occasionally generate inaccurate details, so it's important to verify specific information.

Code Reasoning

The model excels in analyzing code, too. Users can upload an entire codebase and ask the model questions about it. For example, when provided with the entire JAX codebase, it can identify the location of specific methods effectively.

Translation Capabilities

Gemini 1.5 Pro can even translate complex languages. By providing it with a comprehensive grammar manual, a dictionary, and parallel sentences for Kalamang—a lesser-known language—Gemini 1.5 Pro can produce translations that reflect the level of someone learning the language.